set.seed(1)
x<-matrix(rnorm(200*2),ncol=2)
x[1:100,]<-x[1:100,]+2
x[101:150,]=x[101:150,]-2
y<-c(rep(1,150),rep(2,50))
dat<-data.frame(x=x,y=as.factor(y))
plot(x,col=y)
By plotting the results, we can see if the classes are linearly separable. They do not appear so. We will now apply a Support Vector Classifier.
dat<-data.frame(x=x,y=as.factor(y)) # encode response as factor
library(e1071)
svmfit<-svm(y~.,data=dat,kernel="linear",cost=10,scale=FALSE)
plot(svmfit,dat)
Tune to perform cross-validation and store the best model.
set.seed(1)
tune.out<-tune(svm,y~.,data=dat,kernel="linear",ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100)))
bestmod<-tune.out$best.model
Now we can predict the class label on a set of test observations.
xtest<-matrix(rnorm(200*2),ncol=2)
xtest[1:100,]<-x[1:100,]+2
xtest[101:150,]<-x[101:150,]-2
ytest<-sample(c(1,150),20,rep=TRUE)
xtest[ytest==1,]=xtest[ytest==1,]+1
testdat<-data.frame(x=xtest,y=as.factor(ytest))
ypred<-predict(bestmod,testdat)
table(predict=ypred,truth=testdat$y)
## truth
## predict 1 150
## 1 80 120
## 2 0 0
In this case, 60% of the observations are classified incorrectly.
Moving on to Support Vector Machine. First, split the data into training and test groups and then fit with a radial kernel.
train<-sample(200,100)
svmfit<-svm(y~.,data=dat[train,],kernel="radial",gamma=1,cost=1)
plot(svmfit,dat[train,])
This is cool. It shows an apparent non-linear boundary. Now, let’s tune.
set.seed(1)
tune.out<-tune(svm,y~.,data=dat[train,],kernel="radial",ranges=list(cost=c(0.1,1,10,100,1000),gamma=c(0.5,1,2,3,4)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 1 1
##
## - best performance: 0.08
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 1e-01 0.5 0.23 0.10593499
## 2 1e+00 0.5 0.09 0.09944289
## 3 1e+01 0.5 0.09 0.09944289
## 4 1e+02 0.5 0.11 0.11005049
## 5 1e+03 0.5 0.10 0.10540926
## 6 1e-01 1.0 0.23 0.10593499
## 7 1e+00 1.0 0.08 0.09189366
## 8 1e+01 1.0 0.11 0.11005049
## 9 1e+02 1.0 0.12 0.10327956
## 10 1e+03 1.0 0.13 0.11595018
## 11 1e-01 2.0 0.23 0.10593499
## 12 1e+00 2.0 0.08 0.09189366
## 13 1e+01 2.0 0.11 0.11005049
## 14 1e+02 2.0 0.13 0.11595018
## 15 1e+03 2.0 0.16 0.10749677
## 16 1e-01 3.0 0.23 0.10593499
## 17 1e+00 3.0 0.08 0.09189366
## 18 1e+01 3.0 0.12 0.12292726
## 19 1e+02 3.0 0.14 0.09660918
## 20 1e+03 3.0 0.15 0.10801234
## 21 1e-01 4.0 0.23 0.10593499
## 22 1e+00 4.0 0.08 0.09189366
## 23 1e+01 4.0 0.11 0.11972190
## 24 1e+02 4.0 0.15 0.10801234
## 25 1e+03 4.0 0.15 0.10801234
It appears the best cost is 1 and the best gamma is 0.5. Now time to predict!
table(true=dat[-train,"y"],pred=predict(tune.out$best.model,newx=dat[-train,]))
## pred
## true 1 2
## 1 57 16
## 2 20 7
In this case, 42% of the test observations are misclassified. The non-linear decision boundary from the SVM outperformed the SVC with the linear boundary, which is expected because the data itself is non-linear.